Mystery of the Sicilian Olives: the Python Data Science Process

In []:
Get it at https://dl.dropboxusercontent.com/u/75194/BDF/Olives_At_BDF.ipynb
  1. The Data Science Process
  2. The Olives Dataset
  3. Cleaning the Data and Exploratory Viz
    1. Cleaning
    2. Exploring Globally
    3. Exploring by group
  4. Visual Exploration and Separation of Regions
  5. Visual Exploration and Separation of the South: the problem with Sicily
  6. Machine Learning, Prediction, and scikit-learn
  7. LDA and the creation of features.
  8. Where's the test data? Non-parametric, maximum margin SVM on training data for Regions
  9. A memory based model: using kNN on the South, no Sicily
    • Cross-Validation of the kNN and Learning Curves
  10. Support Vectors and Kernels
    • Cross-Validation: train/validate/test.
    • Fitting Hyperparameters and Regularization
  11. Bring back sicily: Decision Treeas and Random Forests. C-val not needed.
  12. Comparing classifiers(ROC). Conclusions. Issues and the Process.
In [1]:
#!pip install seaborn
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
In [2]:
from IPython.core.display import Image

1. What is Data Science

In [3]:
Image(url="http://qph.is.quoracdn.net/main-qimg-3504cc03d0a1581096eba9ef97cfd7eb?convert_to_webp=true")
Out[3]:

The Ipython Notebook serves as a Process Notebook for your research, spanning from your playing around, to your communicating your data. Use Git(hub) to version all the notebooks you create.

2. Italian Olives

In [4]:
Image(url="https://dl.dropboxusercontent.com/u/75194/BDF/Italy.png")
Out[4]:

I found this data set in the RGGobi book (http://www.ggobi.org/book/), from which the above diagram is taken. It has "the percentage composition of fatty acids found in the lipid fraction of Italian olive oils', with oils from 3 regions of Italy: the North, the South, and Sardinia. The regions themselves are subdivided into areas as shown in the map above. The source for this data is:

Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classification of Olive Oils from their Fatty Acid Composition, in Martens, H. and Russwurm Jr., H., eds, Food Research and Data Analysis, Applied Science Publishers, London, pp. 189–214.

3. Cleaning the Data and Exploratory Viz

From my friend and co-TF for CS109 (see https://github.com/cs109/content) Chris Beaumont, wonderfully expressed in:

http://nbviewer.ipython.org/github/cs109/content/blob/master/lec_04_wrangling.ipynb

"I'd like to suggest a basic rubric for the early stages of exploratory data analysis in Python. This isn't universally applicable, but it does cover many patterns which recur in several data analysis contexts. It's useful to keep this rubric in mind when encountering a new dataset."

The basic workflow is as follows:

  • Build a DataFrame from the data (ideally, put all data in this object)
  • Clean the DataFrame. It should have the following properties:

    • Each row describes a single object
    • Each column describes a property of that object
    • Columns are numeric whenever appropriate
    • Columns contain atomic properties that cannot be further decomposed
  • Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.

  • Explore group properties. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis."

In [5]:
import requests
r=requests.get("https://dl.dropboxusercontent.com/u/75194/BDF/olive.csv")
print r.status_code
with open("local_olive.csv","w") as f:
    f.write(r.text.encode("utf-8"))
200

In [6]:
df=pd.read_csv("local_olive.csv")
df.head(5)
Out[6]:
Unnamed: 0 region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
0 1.North-Apulia 1 1 1075 75 226 7823 672 36 60 29
1 2.North-Apulia 1 1 1088 73 224 7709 781 31 61 29
2 3.North-Apulia 1 1 911 54 246 8113 549 31 63 29
3 4.North-Apulia 1 1 966 57 240 7952 619 50 78 35
4 5.North-Apulia 1 1 1051 67 259 7771 672 50 80 46
In [7]:
type(df), type(df.region)
Out[7]:
(pandas.core.frame.DataFrame, pandas.core.series.Series)
In [8]:
Image(url="https://dl.dropboxusercontent.com/u/75194/BDF/pandastruct.png")
Out[8]:
In [9]:
df.dtypes
Out[9]:
Unnamed: 0     object
region          int64
area            int64
palmitic        int64
palmitoleic     int64
stearic         int64
oleic           int64
linoleic        int64
linolenic       int64
arachidic       int64
eicosenoic      int64
dtype: object

We look at the dtypes to see if the columns make sense.

cleaning

Let's rename that ugly first column. A Google search for 'python pandas dataframe rename' points you at this documentation.

In [10]:
print df.columns
df.rename(columns={df.columns[0]:'areastring'}, inplace=True)
df.columns
Index([u'Unnamed: 0', u'region', u'area', u'palmitic', u'palmitoleic', u'stearic', u'oleic', u'linoleic', u'linolenic', u'arachidic', u'eicosenoic'], dtype='object')

Out[10]:
Index([u'areastring', u'region', u'area', u'palmitic', u'palmitoleic', u'stearic', u'oleic', u'linoleic', u'linolenic', u'arachidic', u'eicosenoic'], dtype='object')

Let's explore. Which unique regions and areas are contained in the dataset?

In [11]:
print 'regions\t', df.region.unique()
print 'areas\t', df.area.unique()
regions	[1 2 3]
areas	[1 2 3 4 5 6 9 7 8]

In [12]:
pd.value_counts(df.region)
Out[12]:
1    323
3    151
2     98
dtype: int64
In [13]:
pd.value_counts(df.area)
Out[13]:
3    206
5     65
2     56
9     51
8     50
7     50
4     36
6     33
1     25
dtype: int64

Let's get rid of the junk numbering in df.areastring. For single column Pandas Series we use map. Here's the documentation.

In [14]:
df.areastring=df.areastring.map(lambda x: x.split('.')[-1])
df.head()
Out[14]:
areastring region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
0 North-Apulia 1 1 1075 75 226 7823 672 36 60 29
1 North-Apulia 1 1 1088 73 224 7709 781 31 61 29
2 North-Apulia 1 1 911 54 246 8113 549 31 63 29
3 North-Apulia 1 1 966 57 240 7952 619 50 78 35
4 North-Apulia 1 1 1051 67 259 7771 672 50 80 46
In [15]:
rkeys=[1,2,3]
rvals=['South','Sardinia','North']
rmap={e[0]:e[1] for e in zip(rkeys,rvals)}
print rmap
df['regionstring']=df.region.map(lambda x: rmap[x])
df.head()
{1: 'South', 2: 'Sardinia', 3: 'North'}

Out[15]:
areastring region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic regionstring
0 North-Apulia 1 1 1075 75 226 7823 672 36 60 29 South
1 North-Apulia 1 1 1088 73 224 7709 781 31 61 29 South
2 North-Apulia 1 1 911 54 246 8113 549 31 63 29 South
3 North-Apulia 1 1 966 57 240 7952 619 50 78 35 South
4 North-Apulia 1 1 1051 67 259 7771 672 50 80 46 South

To access a specific subset of columns of a dataframe, we can use list indexing.

In [16]:
df[['palmitic','oleic']].head()
Out[16]:
palmitic oleic
0 1075 7823
1 1088 7709
2 911 8113
3 966 7952
4 1051 7771

The above is a DataFrame.

To access the series of entries of a single column, we could do the following.

In [17]:
print df['palmitic']
0     1075
1     1088
2      911
3      966
4     1051
5      911
6      922
7     1100
8     1082
9     1037
10    1051
11    1036
12    1074
13     875
14     952
...
557    1010
558    1020
559    1120
560    1090
561    1100
562    1090
563    1150
564    1110
565    1010
566    1070
567    1280
568    1060
569    1010
570     990
571     960
Name: palmitic, Length: 572, dtype: int64

Remember the acid percentages were ints. They need to be renormalized. Lets do this:

In [18]:
acidlist=['palmitic', 'palmitoleic', 'stearic', 'oleic', 'linoleic', 'linolenic', 'arachidic', 'eicosenoic']

#your code here

dfsub=df[acidlist].apply(lambda x: x/100.0)
dfsub.head()
Out[18]:
palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
0 10.75 0.75 2.26 78.23 6.72 0.36 0.60 0.29
1 10.88 0.73 2.24 77.09 7.81 0.31 0.61 0.29
2 9.11 0.54 2.46 81.13 5.49 0.31 0.63 0.29
3 9.66 0.57 2.40 79.52 6.19 0.50 0.78 0.35
4 10.51 0.67 2.59 77.71 6.72 0.50 0.80 0.46

Notice that we can replace part of the dataframe by this new frame. Since we need the percentages, let's do this. The Oleic percentages should be in the 70s and 80s if you did this right.

In [19]:
df[acidlist]=dfsub
df.head()
Out[19]:
areastring region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic regionstring
0 North-Apulia 1 1 10.75 0.75 2.26 78.23 6.72 0.36 0.60 0.29 South
1 North-Apulia 1 1 10.88 0.73 2.24 77.09 7.81 0.31 0.61 0.29 South
2 North-Apulia 1 1 9.11 0.54 2.46 81.13 5.49 0.31 0.63 0.29 South
3 North-Apulia 1 1 9.66 0.57 2.40 79.52 6.19 0.50 0.78 0.35 South
4 North-Apulia 1 1 10.51 0.67 2.59 77.71 6.72 0.50 0.80 0.46 South
In [20]:
df.to_csv("local-olives-cleaned.csv", index=False)

exploring globally

In [21]:
pd.crosstab(df.areastring, df.regionstring)
Out[21]:
regionstring North Sardinia South
areastring
Calabria 0 0 56
Coast-Sardinia 0 33 0
East-Liguria 50 0 0
Inland-Sardinia 0 65 0
North-Apulia 0 0 25
Sicily 0 0 36
South-Apulia 0 0 206
Umbria 51 0 0
West-Liguria 50 0 0
In [22]:
pd.value_counts(df.areastring, sort=False).plot(kind="bar");
In [23]:
pd.value_counts(df.regionstring, sort=False).plot(kind="barh");
In [24]:
df.describe()
Out[24]:
region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
count 572.000000 572.000000 572.000000 572.000000 572.000000 572.000000 572.000000 572.000000 572.000000 572.000000
mean 1.699301 4.599650 12.317413 1.260944 2.288654 73.117483 9.805280 0.318881 0.580979 0.162815
std 0.859968 2.356687 1.685923 0.524944 0.367449 4.058102 2.427992 0.129687 0.220302 0.140833
min 1.000000 1.000000 6.100000 0.150000 1.520000 63.000000 4.480000 0.000000 0.000000 0.010000
25% 1.000000 3.000000 10.950000 0.877500 2.050000 70.000000 7.707500 0.260000 0.500000 0.020000
50% 1.000000 3.000000 12.010000 1.100000 2.230000 73.025000 10.300000 0.330000 0.610000 0.170000
75% 3.000000 7.000000 13.600000 1.692500 2.490000 76.800000 11.807500 0.402500 0.700000 0.280000
max 3.000000 9.000000 17.530000 2.800000 3.750000 84.100000 14.700000 0.740000 1.050000 0.580000
In [25]:
df[acidlist].median().plot(kind="bar");
In [26]:
df.palmitic.hist()
plt.xlabel("% of acid")
plt.ylabel("count of oils")
plt.title("Histogram of Palmitic Acid content");
In [27]:
df.palmitic.hist(bins=30, alpha=0.5);
In [28]:
sns.distplot(df.palmitic);

explore by group

The first concept we deal with here is pandas groupby. The idea is to group a dataframe by the values of a particular factor variable. The documentation can be found here.

What we do here ia a technique called Split, Apply, and Combine.

In [29]:
#split
region_groupby = df.groupby('region')
print type(region_groupby)
region_groupby.head()
<class 'pandas.core.groupby.DataFrameGroupBy'>

Out[29]:
areastring region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic regionstring
0 North-Apulia 1 1 10.75 0.75 2.26 78.23 6.72 0.36 0.60 0.29 South
1 North-Apulia 1 1 10.88 0.73 2.24 77.09 7.81 0.31 0.61 0.29 South
2 North-Apulia 1 1 9.11 0.54 2.46 81.13 5.49 0.31 0.63 0.29 South
3 North-Apulia 1 1 9.66 0.57 2.40 79.52 6.19 0.50 0.78 0.35 South
4 North-Apulia 1 1 10.51 0.67 2.59 77.71 6.72 0.50 0.80 0.46 South
323 Inland-Sardinia 2 5 11.29 1.20 2.22 72.72 11.12 0.43 0.98 0.02 Sardinia
324 Inland-Sardinia 2 5 10.42 1.35 2.10 73.76 11.16 0.35 0.90 0.03 Sardinia
325 Inland-Sardinia 2 5 11.03 0.96 2.10 73.80 10.85 0.32 0.94 0.03 Sardinia
326 Inland-Sardinia 2 5 11.18 0.97 2.21 72.79 11.54 0.35 0.94 0.02 Sardinia
327 Inland-Sardinia 2 5 10.52 0.95 2.15 73.88 11.26 0.31 0.92 0.01 Sardinia
421 Umbria 3 9 10.85 0.70 1.80 79.55 6.05 0.20 0.50 0.01 North
422 Umbria 3 9 10.85 0.70 1.85 79.55 6.00 0.25 0.55 0.01 North
423 Umbria 3 9 10.90 0.60 1.90 79.50 6.00 0.28 0.47 0.02 North
424 Umbria 3 9 10.80 0.65 1.89 79.60 6.02 0.35 0.20 0.01 North
425 Umbria 3 9 10.90 0.60 1.95 79.55 6.00 0.28 0.42 0.02 North

The groupby function also acts like an object that can be mapped. After the mapping is complete, the rows are put together (reduced) into a larger dataframe. For example, using the describe function.

In [30]:
region_groupby.describe()
Out[30]:
region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
region
1 count 323 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000
mean 1 2.783282 13.322879 1.548019 2.287740 71.000093 10.334985 0.380650 0.631176 0.273220
std 0 0.741054 1.529349 0.507237 0.398709 3.451431 2.106730 0.079727 0.111644 0.083915
min 1 1.000000 8.750000 0.350000 1.520000 63.000000 4.480000 0.200000 0.320000 0.100000
25% 1 2.500000 12.680000 1.215000 2.015000 68.830000 8.555000 0.320000 0.560000 0.220000
50% 1 3.000000 13.460000 1.630000 2.230000 70.300000 10.900000 0.370000 0.620000 0.270000
75% 1 3.000000 14.190000 1.850000 2.495000 72.835000 12.025000 0.440000 0.690000 0.320000
max 1 4.000000 17.530000 2.800000 3.750000 81.130000 14.620000 0.740000 1.020000 0.580000
2 count 98 98.000000 98.000000 98.000000 98.000000 98.000000 98.000000 98.000000 98.000000 98.000000
mean 2 5.336735 11.113469 0.967449 2.261837 72.680204 11.965306 0.270918 0.731735 0.019388
std 0 0.475023 0.404111 0.138514 0.176363 1.418783 1.072336 0.053844 0.118826 0.007436
min 2 5.000000 10.300000 0.350000 1.990000 68.820000 10.570000 0.150000 0.450000 0.010000
25% 2 5.000000 10.852500 0.882500 2.120000 71.372500 11.122500 0.230000 0.660000 0.010000
50% 2 5.000000 11.075000 0.960000 2.220000 73.255000 11.465000 0.270000 0.720000 0.020000
75% 2 6.000000 11.372500 1.040000 2.395000 73.810000 13.065000 0.300000 0.810000 0.020000
max 2 6.000000 12.130000 1.350000 2.720000 74.390000 14.700000 0.430000 1.050000 0.030000
3 count 151 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000
mean 3 8.006623 10.948013 0.837351 2.308013 77.930530 7.270331 0.217881 0.375762 0.019735
std 0 0.820542 0.825635 0.264388 0.389560 1.648155 1.431226 0.168865 0.293586 0.007298
min 3 7.000000 6.100000 0.150000 1.700000 73.400000 5.100000 0.000000 0.000000 0.010000
25% 3 7.000000 10.600000 0.690000 2.000000 76.800000 6.020000 0.100000 0.100000 0.010000
50% 3 8.000000 10.900000 0.800000 2.300000 78.000000 6.800000 0.200000 0.380000 0.020000
75% 3 9.000000 11.250000 1.000000 2.500000 79.500000 8.250000 0.350000 0.595000 0.025000
max 3 9.000000 14.000000 1.800000 3.500000 84.100000 10.500000 0.700000 1.000000 0.030000

One can do bulk functions on the regional group dataframes or series:

In [31]:
region_groupby.eicosenoic.mean()
Out[31]:
region
1         0.273220
2         0.019388
3         0.019735
Name: eicosenoic, dtype: float64
In [32]:
region_groupby.mean()
Out[32]:
area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
region
1 2.783282 13.322879 1.548019 2.287740 71.000093 10.334985 0.380650 0.631176 0.273220
2 5.336735 11.113469 0.967449 2.261837 72.680204 11.965306 0.270918 0.731735 0.019388
3 8.006623 10.948013 0.837351 2.308013 77.930530 7.270331 0.217881 0.375762 0.019735

Or one can use aggregate to pass an arbitrary function of to the sub-dataframe. The function is applied columnwise.

In [33]:
dfbymean=df.groupby("regionstring").aggregate(np.mean)
dfbymean.head()
Out[33]:
region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
regionstring
North 3 8.006623 10.948013 0.837351 2.308013 77.930530 7.270331 0.217881 0.375762 0.019735
Sardinia 2 5.336735 11.113469 0.967449 2.261837 72.680204 11.965306 0.270918 0.731735 0.019388
South 1 2.783282 13.322879 1.548019 2.287740 71.000093 10.334985 0.380650 0.631176 0.273220
In [34]:
with sns.axes_style("white", {'grid':False}):
    dfbymean[acidlist].plot(kind='barh', stacked=True);
    sns.despine()

Or one can use apply to pass an arbitrary function to the sub-dataframe. This one takes the dataframe as argument.

In [35]:
region_groupby.apply(lambda f: f.palmitic.mean())
Out[35]:
region
1         13.322879
2         11.113469
3         10.948013
dtype: float64

4. Figuring the dataset by Region

In [36]:
g=sns.FacetGrid(df, col="region")
g.map(plt.scatter,"eicosenoic", "linoleic");

Clearly, region 1 or the South can visually be separated out by eicosenoic fraction itself.

In [37]:
with sns.axes_style("white"):
    g=sns.FacetGrid(df, col="region")
    g.map(sns.distplot, "eicosenoic")

We make a SPLOM using seaborn to see in what space the regions may be separated. Note that linoleic and oleic seem promising. And perhaps arachidic paired with eicosenoic.

In [38]:
sns.pairplot(df, vars=acidlist, hue="regionstring", size=2.5, diag_kind='kde');

Pandas supports conditional indexing: documentation. Lets use it to follow up on the clear pattern of Southern oils seeeming to be separable by just the eicosenoic feature.

In [39]:
loweico=df[df.eicosenoic < 0.02]
pd.crosstab(loweico.areastring, loweico.regionstring)
Out[39]:
regionstring North Sardinia
areastring
Coast-Sardinia 0 11
East-Liguria 17 0
Inland-Sardinia 0 19
Umbria 14 0
West-Liguria 11 0

Indeed this is the case! Can also be seen using parallel co-ordinates:

In [40]:
from pandas.tools.plotting import parallel_coordinates
dfna=df[acidlist]
#normalizing by range
dfna_norm = (dfna - dfna.mean()) / (dfna.max() - dfna.min())
with sns.axes_style("white"):
    parallel_coordinates(df[['regionstring']].join(dfna_norm), 'regionstring', alpha=0.3)

5. Figuring the South of Italy by Area

In [41]:
dfsouth=df[df.regionstring=='South']
dfsouth.head()
Out[41]:
areastring region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic regionstring
0 North-Apulia 1 1 10.75 0.75 2.26 78.23 6.72 0.36 0.60 0.29 South
1 North-Apulia 1 1 10.88 0.73 2.24 77.09 7.81 0.31 0.61 0.29 South
2 North-Apulia 1 1 9.11 0.54 2.46 81.13 5.49 0.31 0.63 0.29 South
3 North-Apulia 1 1 9.66 0.57 2.40 79.52 6.19 0.50 0.78 0.35 South
4 North-Apulia 1 1 10.51 0.67 2.59 77.71 6.72 0.50 0.80 0.46 South

We make a couple of SPLOM's, one with sicily and one without sicily, to see whats separable. Sicily seems to be a problem. As before, see the KDE's first to see if separability exists and then let the eye look for patterns.

In [42]:
sns.pairplot(dfsouth, hue="areastring", size=2.5, vars=acidlist, diag_kind='kde');
In [43]:
sns.pairplot(dfsouth[dfsouth.areastring!="Sicily"], hue="areastring", size=2.5, vars=acidlist, diag_kind='kde');

Seems that combinations of oleic, palmitic, palmitoleic might be useful?

6. Machine Learning, Prediction, and scikit-learn

So lets do some prediction. After all we want to test for pourity of oils, and climactic effects.

The usual way we do prediction is to minimize a loss function. This is the most general formulation. These minimizations can come out of generative models like LDA, discriminative models like Logistic Regression, or discriminant models like SVM which have only a cost function but no probabilistic model behind them.

Models may be parametric like logistic regression where we fit for regression coefficients, or non parametric like kNN or decision trees. Amongst those two, the former is a memory based model, where you need to keep the data set in memory to make predictions. Sometimes you keep inner products of points in memory, in the so-called kernelised algorithms.

First, a bit about the structure of scikit-learn:

Some of the following text is taken from the scikit-learn API paper: http://arxiv.org/pdf/1309.0238v1.pdf

All objects within scikit-learn share a uniform common basic API consisting of three complementary interfaces: an estimator interface for building and fitting models, a predictor interface for making predictions and a transformer interface for converting data. The estimator interface is at the core of the library. It defines instantiation mechanisms of objects and exposes a fit method for learning a model from training data. All supervised and unsupervised learning algorithms (e.g., for classification, regression or clustering) are offered as objects implementing this interface. Machine learning tasks like feature extraction, feature selection or dimensionality reduction are also provided as estimators.

An example along these lines:

clf = LogisticRegression()
clf.fit(X_train, y_train)

If one changes classifiers, say, to a Random Forest classifier, one would simply replace LogisticRegression() in the snippet above by RandomForestClassifier().

The predictor interface extends the notion of an estimator by adding a predict method that takes an array X_test and produces predictions for X_test, based on the learned parameters of the estimator. In the case of supervised learning estimators, this method typically returns the predicted labels or values computed by the model. Some unsupervised learning estimators may also implement the predict interface, such as k-means, where the predicted values are the cluster labels.

clf.predict(X_test)

Since it is common to modify or filter data before feeding it to a learning algorithm, some estimators in the library implement a transformer interface which defines a transform method. It takes as input some new data X_test and yields as output a transformed version of X_test. Preprocessing, feature selection, feature extraction and dimensionality reduction algorithms are all provided as transformers within the library.

This is usually done via the fit_transform method. For example, to do a PCA:

pca = RandomizedPCA(n_components=2)
train_x = pca.fit_transform(train_x)
test_x = pca.transform(test_x)

The training set here is "fit" to find the PC components, and then then transformed. Since pca.fit() by itself changes the pca object, if we want to transform other data using the same transformation we simply call transform subsequently.

From: http://nbviewer.ipython.org/urls/raw.github.com/jakevdp/sklearn_scipy2013/master/rendered_notebooks/02.1_representation_of_data.ipynb

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]

7. Dimensional reduction, and the creation of features.

In [44]:
#some code taken from https://github.com/jakevdp/ESAC-stats-2014
from IPython.html.widgets import interact
from mpl_toolkits import mplot3d

def plot_3D(X, y, z, elev=30, azim=30):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(X[:, 0], X[:, 1], z, c=y, s=50)
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('eicosenoic')
    ax.set_ylabel('linoleic')
    ax.set_zlabel('arachidic')

#set azim to -77, elev to 24
def plotter(elev, azim):
    return plot_3D(df[['eicosenoic', 'linoleic']].values, df.region.values, df['arachidic'].values, elev=elev, azim=azim)
interact(plotter, elev=[-90, 90], azim=(-180, 180));
In [45]:
from sklearn.lda import LDA
X=df[acidlist]
y=df.region
sklearn_lda = LDA(n_components=2)
X_lda_sklearn = sklearn_lda.fit_transform(X, y)

plt.scatter(X_lda_sklearn[:, 0], X_lda_sklearn[:, 1], c=y, s=50)
plt.xlabel("LDA1");
plt.ylabel("LDA2");
In [46]:
dfnew=df[['eicosenoic', 'region', 'regionstring']]
dfnew['linoarch']=(0.969/1022.0)*df.linoleic + (0.245/105.0)*df.arachidic
dfnew.head()
Out[46]:
eicosenoic region regionstring linoarch
0 0.29 1 South 0.007772
1 0.29 1 South 0.008828
2 0.29 1 South 0.006675
3 0.35 1 South 0.007689
4 0.46 1 South 0.008238
In [47]:
plt.scatter(dfnew.linoarch, dfnew.eicosenoic, c=dfnew.region, s=50)
plt.xlabel("linoarch")
plt.ylabel("eicosenoic");
In [48]:
def standardize(l):
    return (l - l.mean()) / l.std()
plt.scatter(standardize(dfnew.linoarch), standardize(dfnew.eicosenoic), c=dfnew.region, s=50);
In [49]:
plt.scatter(standardize(df.linoleic), standardize(df.eicosenoic), c=df.region, s=50)
Out[49]:
<matplotlib.collections.PathCollection at 0x111ba1dd0>
In [50]:
dfnosouth=df[df.regionstring!='South']
dfnosouth.head()
Out[50]:
areastring region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic regionstring
323 Inland-Sardinia 2 5 11.29 1.20 2.22 72.72 11.12 0.43 0.98 0.02 Sardinia
324 Inland-Sardinia 2 5 10.42 1.35 2.10 73.76 11.16 0.35 0.90 0.03 Sardinia
325 Inland-Sardinia 2 5 11.03 0.96 2.10 73.80 10.85 0.32 0.94 0.03 Sardinia
326 Inland-Sardinia 2 5 11.18 0.97 2.21 72.79 11.54 0.35 0.94 0.02 Sardinia
327 Inland-Sardinia 2 5 10.52 0.95 2.15 73.88 11.26 0.31 0.92 0.01 Sardinia
In [51]:
plt.scatter(standardize(dfnosouth.linoleic), standardize(dfnosouth.arachidic), c=dfnosouth.region, s=50);

8. Where's the test data? Non-parametric, maximum margin SVM on training data for Regions

from Jesse Johnson's excellent Shape of Data; http://shapeofdata.wordpress.com/2013/05/14/linear-separation-and-support-vector-machines/

max margin The idea is to draw a line in space between the classes. But not any line, but the line which gives the maximum margin rectangle between points of different classes.

from http://nlp.stanford.edu/IR-book/html/htmledition/support-vector-machines-the-linearly-separable-case-1.html : support vectors

The points right at the boundary are called the support vectors, which is where the name comes from.

But what if the separability is not so simple, and there are points intruding?

intrusion

Then the idea is to minimize the distance of the "crossed over" points from the separating line. These crossed over points are costed using "slack" vectors. You dont want too many of these.

You obtain the line my minimizing the Hinge Loss

In [52]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
#Stolen from Jake's notebooks, above: https://github.com/jakevdp/ESAC-stats-2014
from sklearn.datasets.samples_generator import make_blobs
from sklearn.svm import SVC # "Support Vector Classifier"

def plot_svc_decision_function(clf, ax=None):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
    y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
    Y, X = np.meshgrid(y, x)
    P = np.zeros_like(X)
    for i, xi in enumerate(x):
        for j, yj in enumerate(y):
            P[i, j] = clf.decision_function([xi, yj])
    return ax.contour(X, Y, P, colors='k',
                      levels=[-1, 0, 1], alpha=0.5,
                      linestyles=['--', '-', '--'])

def plot_svm(N):
    X, y = make_blobs(n_samples=200, centers=2,
                      random_state=0, cluster_std=0.60)
    X = X[:N]
    y = y[:N]
    clf = SVC(kernel='linear')
    clf.fit(X, y)
    plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')
    plt.xlim(-1, 4)
    plt.ylim(-1, 6)
    plot_svc_decision_function(clf, plt.gca())
    plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
                s=200, facecolors='none')
    
interact(plot_svm, N=[10, 200], kernel='linear');

Notice how the points that mainly matter are the ones which are near the support vector. If new such points come in, the prediction can change.

Test and Training sets

So far we have been progressing with the entire data set. This is crazy. We have nothing to test on, and further, we might be particularizing to the peculiarities of our sample!

In [53]:
X = dfnosouth[['linoleic', 'arachidic']]
Xstd= (X -X.mean())/X.std()
y = (dfnosouth.regionstring.values=='Sardinia')*1
Xtrain, Xtest, ytrain, ytest = train_test_split(Xstd.values,y)
clf = SVC(kernel="linear")
clf.fit(Xtrain, ytrain)
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring', alpha=0.3)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
                s=200, facecolors='none')
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, marker="s", cmap='spring', alpha=0.5);
In [54]:
clf.score(Xtest, ytest)
Out[54]:
1.0

Often in SVMs one uses the kernel trick, which maps a lower dimension to a higher one to make things separable.

See (from above mentioned book)

So lets see what using a Radial Gaussian kernel look like?

eγd(x1,x2)2

In [55]:
X = dfnosouth[['linoleic', 'arachidic']]
Xstd= (X -X.mean())/X.std()
y = (dfnosouth.regionstring.values=='Sardinia')*1
Xtrain, Xtest, ytrain, ytest = train_test_split(Xstd.values,y)
clf = SVC()
clf.fit(Xtrain, ytrain)
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring', alpha=0.3)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
                s=200, facecolors='none')
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, marker="s", cmap='spring', alpha=0.5);
In [56]:
clf
Out[56]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [57]:
confusion_matrix(clf.predict(Xtest), ytest)
Out[57]:
array([[31,  0],
       [ 0, 32]])

9. A memory based model: using kNN on the South, no Sicily

The SVM is a non parametric, cost function minimization baded model. But why not consider a simpler voting based model? kNN is an algorithm where for every point, we find the k nearest neighbors. We then ask them: what are you? We assign to the point the value we get from the majority vote of the neighbors.

The key thing to take from here is the locality of the classification.

In [58]:
from matplotlib.colors import ListedColormap

cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

def points_plot2(X, Xtr, Xte, ytr, yte, clf, colorscale=cmap_light, cdiscrete=cmap_bold):
    h = .02
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    plt.figure(figsize=(10,6))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=0.2)
    plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cdiscrete, s=50, alpha=0.2,edgecolor="k")
    # and testing points
    yact=clf.predict(Xte)
    plt.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cdiscrete, alpha=0.5, marker="s", s=35)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    return plt.gca()
In [59]:
def classify_tt(indf, inacidlist, clon, clf, train_size=0.6):
    subdf=indf[inacidlist]
    subdfstd=(subdf - subdf.mean())/subdf.std()
    X=subdfstd.values
    y=indf[clon].values
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=train_size)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    Xtr=np.concatenate((Xtrain, Xtest))
    if len(inacidlist) ==2 :
        print "making plot"
        points_plot2(Xtr, Xtrain, Xtest, ytrain, ytest, clf)
    return clf, Xtest, ytest
In [60]:
dfsouthns=dfsouth[dfsouth.areastring!="Sicily"]
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(20, warn_on_equidistant=False)
clf, Xtest, ytest=classify_tt(dfsouthns, ['palmitic','palmitoleic'],'area',clf)
Accuracy on training data: 0.94
Accuracy on test data:     0.94
making plot

At high k, you are in the bias regime, as you are averaging too many neighbors. At low k you are in the variance regime, as you are very sensitive to local effects. You will overfit in this regime, as you are basically capturing local details rather than any model. We see this tension between locality and bias play out in all classifiers.

In [61]:
def class_with_size(K,T):
    clf = KNeighborsClassifier(K, warn_on_equidistant=False)
    clf, Xest, ytest=classify_tt(dfsouthns, ['palmitic','palmitoleic'],'area',clf, T)
interact(class_with_size, K=[1,40], T=[0.1,1.0]);
Accuracy on training data: 0.96
Accuracy on test data:     0.91
making plot

Bias, and Variance

What are we doing in finding all these fits?

The bottom line is: finding the model which has an appropriate mix of bias and variance. We usually want to sit at the point of the tradeoff between the two: be simple but no simpler than necessary.

We do not want a model with too much variance: it would not generalize well. This phenomenon is also called overfitting. There is no point doing prediction if we cant generalize well. At the same time, if we have too much bias in our model, we will systematically underpredict or overpredict values and miss most predictions. This is also known as underfitting.

Cross-Validation provides us a way to find the "hyperparameters" of our model, such that we achieve the balance point.

Read http://scott.fortmann-roe.com/docs/BiasVariance.html

How to tell that a hypothesis is overfitting? Its not enough that the training error is low, though thats certainly an indication.

The training error is low but test error is high!

If we plot training error against, say, COMPLEXITY d, the training error will decrease with increasing d. But for the cross-validation (or for that matter, test error), we'll have an error curve which has a minumum and goes up again. Here the complexity is the inverse of the k.

polynomial regression

In [62]:
def get_stats(indf, inacidlist, clon):
    subdf=indf[inacidlist]
    subdfstd=(subdf - subdf.mean())/subdf.std()
    X=subdfstd.values
    y=indf[clon].values
    print y.shape
    numk=40
    numt=15
    numberOfRows=numk*numt
    df=pd.DataFrame(index=np.arange(0, numberOfRows), columns=('knearest', 'trainsize', 'train_accuracy', 'test_accuracy') )
    counter=0
    for tsize in np.arange(0.2, 0.95, 0.05):
        for k in range(1,numk+1):
            #print k, tsize
            tra=[]
            tea=[]
            for n in range(30):
                Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=tsize)
                clf = KNeighborsClassifier(k, warn_on_equidistant=False)
                clf=clf.fit(Xtrain, ytrain)
                tra.append(clf.score(Xtrain, ytrain))
                tea.append(clf.score(Xtest, ytest))
            training_accuracy=np.mean(tra)
            test_accuracy=np.mean(tea)
            df.loc[counter] = [k, tsize, training_accuracy, test_accuracy]
            counter=counter+1
    return df
In [63]:
dfknn = get_stats(dfsouthns, ['palmitic','palmitoleic'],'area')
dft65=dfknn[(0.63 < dfknn.trainsize) & (dfknn.trainsize < 0.67)]
dft65.head()
(287,)

Out[63]:
knearest trainsize train_accuracy test_accuracy
360 1 0.65 0.9983871 0.9082508
361 2 0.65 0.9501792 0.8937294
362 3 0.65 0.9503584 0.9237624
363 4 0.65 0.9422939 0.9290429
364 5 0.65 0.9410394 0.9267327
In [64]:
plt.plot(dft65.knearest, 1.-dft65.train_accuracy)
plt.plot(dft65.knearest, 1.-dft65.test_accuracy)
Out[64]:
[<matplotlib.lines.Line2D at 0x11087e2d0>]

Learning Curves and how to choose a model

Here we plot the train vs cv/test error as a function of the size of the training set.

The training set error increases as size of the data set increases. The intuition is that with more samples, you get further away from the interpolation limit. The cross validation error on the otherhand will decrease as training set size increases, as , more data you have better the hypothesis you fit.

High Bias

Now consider the high bias situation. The training error will increase as before, to a point, and then flatten out. (There is only so much you can do to make a straight line fit a quadratic curve). The cv/test error, on the other hand will decrease, but then, it too will flatten out. These will be very close to each other, and after a point, getting more training data will not help!

Learning Curve under high bias situation

Next consider the high variance situation. The training error will start out very low as usual, and go up slowly as even though we add points, we have enough wiggle room to start with, until it runs out and the error keeps increasing. The cv error, will, on the other hand, start out quite high, and remain high. Thus we will have a gap. In this case it will make sense to take more data, as that would drive the cv error down, and the training error up, until they meet.

Learning Curve under high variance situation

In [65]:
dfk40=dfknn[dfknn.knearest==40]
plt.plot(dfk40.trainsize, 1.-dfk40.train_accuracy)
plt.plot(dfk40.trainsize, 1.-dfk40.test_accuracy)
Out[65]:
[<matplotlib.lines.Line2D at 0x10e9626d0>]
In [66]:
dfk2=dfknn[dfknn.knearest==2]
plt.plot(dfk2.trainsize, 1.-dfk2.train_accuracy)
plt.plot(dfk2.trainsize, 1.-dfk2.test_accuracy)
Out[66]:
[<matplotlib.lines.Line2D at 0x10ae23a50>]

Cross Validation

Its not enough to train one one training set, and test on another. Suppose you landed up, randomly with a peculiar sample? Furthermore, most models have some other parameters which need to be trained. For example, kNN has the number of neighbors. The SVM has a regularization parameter 1/C and if radial, the bandwidth γ.

In scikit-learn, there is the concept of a meta-estimator, which behaves quite similarly to standard estimators, but allows us to wrap, for example, cross-validation, or methods that build and combine simpler models or schemes. For example:

from sklearn.multiclass import OneVsOneClassifier
clf=OneVsOneClassifier(LogisticRegression())

In scikit-learn, model selection is supported in two distinct meta-estimators, GridSearchCV and RandomizedSearchCV. They take as input an estimator (basic or composite), whose hyper-parameters must be optimized, and a set of hyperparameter settings to search through.

We now actually go ahead and to the train/test split. Not once but multiple times, on a grid search, for different values of C, where C is our complexity parameter(s). For each C, we:

  1. create n_folds folds.
  2. We then train on n_folds -1 of these folds, test on the remaining
  3. We average the results of all such combinations
  4. We move on to the next value of C, and find the optimal value that minimizes mean square error.
  5. We finally use that value to make the final fit.

Notice the structure of the GridSearchCV estimator in cv_optimize.

In [67]:
from sklearn.grid_search import GridSearchCV

def cv_optimize_knn(X, y, n_folds=10):
    clf = KNeighborsClassifier()
    parameters = {"n_neighbors": range(1,60,1)}
    gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(X, y)
    return gs
In [68]:
def get_optim_classifier_knn(indf, inacidlist, clon):
    subdf=indf[inacidlist]
    subdfstd=(subdf - subdf.mean())/subdf.std()
    X=subdfstd.values
    y=indf[clon].values
    #Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.95)
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.8)
    fitted=cv_optimize_knn(Xtrain, ytrain)
    #fitted=cv_optimize(X, y)
    return fitted, Xtest, ytest
In [69]:
thefit, Xte, yte = get_optim_classifier_knn(dfsouthns, ['palmitic','palmitoleic'],'area')
In [70]:
thefit.best_estimator_, thefit.best_params_, thefit.best_score_
Out[70]:
(KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_neighbors=13, p=2, weights='uniform'),
 {'n_neighbors': 13},
 0.95196506550218341)
In [71]:
thefit.score(Xte, yte)
Out[71]:
0.87931034482758619
In [72]:
plt.scatter([e[0]['n_neighbors'] for e in thefit.grid_scores_],[e[1] for e in thefit.grid_scores_])
Out[72]:
<matplotlib.collections.PathCollection at 0x110804890>

Classification with SVM and regularization

We alluded earlier to the cost function for SVM and talked about the regularization parameter C.

The loss looks like

C×MisclassificationPenalty+ComplexityPenalty

If C is large, the Complexity penalty, usually |w|2 is not relevant. If it is small, we get a hard margin classifier which is prone to overfitting as it is sensitive to outliers. Missclassification penalty is

1t.y,y=W.x+b

The actual objective function that soft margin SVM tries to minimize is

12w2+Cimax(0,1yi(wxi+b))

However, for hard margin SVM, the whole objective function is just

12w2

as the sign flips and the max becomes 0

In [73]:
def cv_optimize_svm(X, y, n_folds=10, num_p=50):
    #clf = SVC()
    #parameters = {"C": np.logspace(-4, 3, num=num_p), "gamma": np.logspace(-4, 3, num=10)}
    clf = SVC(kernel="linear", probability=True)
    parameters = {"C": np.logspace(-4, 3, num=num_p)}
    gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(X, y)
    return gs

def get_optim_classifier_svm(indf, inacidlist, clon, clonval):
    subdf=indf[inacidlist]
    subdfstd=(subdf - subdf.mean())/subdf.std()
    X=subdfstd.values
    y=(indf[clon].values==clonval)*1
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.8)
    #Xtrain, Xtest, ytrain, ytest=X,X,y,y
    fitted=cv_optimize_svm(Xtrain, ytrain)
    return fitted, Xtrain, ytrain, Xtest, ytest
In [74]:
thesvcfit, Xtr, ytr, Xte, yte = get_optim_classifier_svm(dfnosouth, ['linoleic','arachidic'],'regionstring', "Sardinia")
#thesvcfit, Xtr, ytr, Xte, yte = get_optim_classifier_binary(dfsouthns, ['palmitic','palmitoleic'],'area', 3)
thesvcfit.best_estimator_, thesvcfit.best_params_, thesvcfit.best_score_
Out[74]:
(SVC(C=0.071968567300115138, cache_size=200, class_weight=None, coef0=0.0,
   degree=3, gamma=0.0, kernel='linear', max_iter=-1, probability=True,
   random_state=None, shrinking=True, tol=0.001, verbose=False),
 {'C': 0.071968567300115138},
 1.0)
In [75]:
def plot_svm_new(clf,Xtr,ytr,Xte,yte):
    plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr, s=50, cmap='spring')
    plt.scatter(Xte[:, 0], Xte[:, 1], marker='s', c=yte, s=50, cmap='spring')

    #plt.xlim(-1, 4)
    #plt.ylim(-1, 6)
    plot_svc_decision_function(clf, plt.gca())
    plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
                s=200, facecolors='none')
print dict(kernel="linear",**thesvcfit.best_params_)
clsvc=SVC(**dict(kernel="linear",**thesvcfit.best_params_)).fit(Xtr, ytr)
plot_svm_new(clsvc, Xtr, ytr, Xte, yte)
{'kernel': 'linear', 'C': 0.071968567300115138}

The best fit allows for a bigger margin by allowing some inbetween penalization. If we use the standard C=1 in scikit-learn you see that we are allowing for less penalization.

In [76]:
#X = dfsouthns[['palmitic', 'palmitoleic']]
#y = (dfsouthns.area.values==3)*1
X=dfnosouth[['linoleic', 'arachidic']]
y = (dfnosouth.regionstring.values=='Sardinia')*1
Xstd= (X -X.mean())/X.std()
Xtrain, Xtest, ytrain, ytest = train_test_split(Xstd.values,y)
clf = SVC(kernel="linear")
clf.fit(Xtrain, ytrain)
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring', alpha=0.3)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
                s=200, facecolors='none')
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, marker="s", cmap='spring', alpha=0.5);
In [77]:
clf
Out[77]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [78]:
clf.score(Xtest, ytest)
Out[78]:
1.0
In [79]:
thesvcfit.grid_scores_
Out[79]:
[mean: 0.60302, std: 0.00947, params: {'C': 0.0001},
 mean: 0.60302, std: 0.00947, params: {'C': 0.00013894954943731373},
 mean: 0.60302, std: 0.00947, params: {'C': 0.00019306977288832496},
 mean: 0.60302, std: 0.00947, params: {'C': 0.00026826957952797245},
 mean: 0.60302, std: 0.00947, params: {'C': 0.00037275937203149379},
 mean: 0.60302, std: 0.00947, params: {'C': 0.0005179474679231213},
 mean: 0.60302, std: 0.00947, params: {'C': 0.00071968567300115217},
 mean: 0.60302, std: 0.00947, params: {'C': 0.001},
 mean: 0.60302, std: 0.00947, params: {'C': 0.0013894954943731374},
 mean: 0.60302, std: 0.00947, params: {'C': 0.0019306977288832496},
 mean: 0.63819, std: 0.03511, params: {'C': 0.0026826957952797246},
 mean: 0.91960, std: 0.03981, params: {'C': 0.0037275937203149379},
 mean: 0.98995, std: 0.02000, params: {'C': 0.0051794746792312128},
 mean: 0.98492, std: 0.02291, params: {'C': 0.0071968567300115137},
 mean: 0.98492, std: 0.02291, params: {'C': 0.01},
 mean: 0.98492, std: 0.02291, params: {'C': 0.013894954943731374},
 mean: 0.98492, std: 0.02291, params: {'C': 0.019306977288832496},
 mean: 0.98995, std: 0.02000, params: {'C': 0.026826957952797246},
 mean: 0.98995, std: 0.02000, params: {'C': 0.037275937203149381},
 mean: 0.99497, std: 0.01500, params: {'C': 0.051794746792312073},
 mean: 1.00000, std: 0.00000, params: {'C': 0.071968567300115138},
 mean: 1.00000, std: 0.00000, params: {'C': 0.10000000000000001},
 mean: 1.00000, std: 0.00000, params: {'C': 0.13894954943731375},
 mean: 1.00000, std: 0.00000, params: {'C': 0.19306977288832497},
 mean: 1.00000, std: 0.00000, params: {'C': 0.26826957952797248},
 mean: 1.00000, std: 0.00000, params: {'C': 0.37275937203149379},
 mean: 1.00000, std: 0.00000, params: {'C': 0.51794746792312074},
 mean: 1.00000, std: 0.00000, params: {'C': 0.71968567300115138},
 mean: 1.00000, std: 0.00000, params: {'C': 1.0},
 mean: 1.00000, std: 0.00000, params: {'C': 1.3894954943731359},
 mean: 1.00000, std: 0.00000, params: {'C': 1.9306977288832496},
 mean: 1.00000, std: 0.00000, params: {'C': 2.6826957952797219},
 mean: 1.00000, std: 0.00000, params: {'C': 3.7275937203149381},
 mean: 1.00000, std: 0.00000, params: {'C': 5.1794746792312125},
 mean: 1.00000, std: 0.00000, params: {'C': 7.1968567300115138},
 mean: 1.00000, std: 0.00000, params: {'C': 10.0},
 mean: 1.00000, std: 0.00000, params: {'C': 13.89495494373136},
 mean: 1.00000, std: 0.00000, params: {'C': 19.306977288832496},
 mean: 1.00000, std: 0.00000, params: {'C': 26.826957952797219},
 mean: 1.00000, std: 0.00000, params: {'C': 37.275937203149383},
 mean: 1.00000, std: 0.00000, params: {'C': 51.794746792312019},
 mean: 1.00000, std: 0.00000, params: {'C': 71.96856730011514},
 mean: 1.00000, std: 0.00000, params: {'C': 100.0},
 mean: 1.00000, std: 0.00000, params: {'C': 138.94954943731361},
 mean: 1.00000, std: 0.00000, params: {'C': 193.06977288832496},
 mean: 1.00000, std: 0.00000, params: {'C': 268.26957952797216},
 mean: 1.00000, std: 0.00000, params: {'C': 372.7593720314938},
 mean: 1.00000, std: 0.00000, params: {'C': 517.94746792312026},
 mean: 1.00000, std: 0.00000, params: {'C': 719.68567300115137},
 mean: 1.00000, std: 0.00000, params: {'C': 1000.0}]
In [80]:
plt.scatter([e[0]['C'] for e in thesvcfit.grid_scores_],[e[1] for e in thesvcfit.grid_scores_])
plt.xscale("log");
plt.xlim([0.0001, 10000]);

And this goes super hard margin. (for more insight into all this see http://stats.stackexchange.com/questions/74499/what-is-the-loss-function-of-hard-margin-svm)

In [81]:
#X = dfsouthns[['palmitic', 'palmitoleic']]
#y = (dfsouthns.area.values==3)*1
X=dfnosouth[['linoleic', 'arachidic']]
y = (dfnosouth.regionstring.values=='Sardinia')*1
Xstd= (X -X.mean())/X.std()
Xtrain, Xtest, ytrain, ytest = train_test_split(Xstd.values,y)
clf = SVC(kernel="linear", C=100000)
clf.fit(Xtrain, ytrain)
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring', alpha=0.3)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
                s=200, facecolors='none')
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, marker="s", cmap='spring', alpha=0.5);
print clf.score(Xtest, ytest)
1.0

In [82]:
Image("http://scikit-learn.org/dev/_static/ml_map.png")
Out[82]:

12. Bring back sicily: Descision Trees and Random Forests. C-val not needed.

Descision trees are very simple things we are all familiar with. If a problem is multi-dimensional, the tree goes dimension by dimension and makes cuts in the space to create a classifier.

In [83]:
def classify_tree(indf, inacidlist, clon, clf, train_size=0.6):
    subdf=indf[inacidlist]
    subdfstd=(subdf - subdf.mean())/subdf.std()
    X=subdfstd.values
    y=indf[clon].values
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=train_size)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    return clf, Xtrain, ytrain, Xtest, ytest
In [84]:
#from Jake's ESAC notebook
def visualize_tree(estimator, Xtr, ytr, Xte, yte, boundaries=True,
                   xlim=None, ylim=None):
    estimator.fit(Xtr, ytr)

    if xlim is None:
        xlim = (Xtr[:, 0].min() - 0.1, Xtr[:, 0].max() + 0.1)
    if ylim is None:
        ylim = (Xtr[:, 1].min() - 0.1, Xtr[:, 1].max() + 0.1)

    x_min, x_max = xlim
    y_min, y_max = ylim
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='rainbow')
    plt.clim(y.min(), ytr.max())

    # Plot also the training points
    plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr, s=50, cmap='rainbow')
    plt.scatter(Xte[:, 0], Xte[:, 1], c=yte, s=50, marker='s', cmap='rainbow')

    plt.axis('off')

    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)        
    plt.clim(ytr.min(), ytr.max())
    
    # Plot the decision boundaries
    def plot_boundaries(i, xlim, ylim):
        if i < 0:
            return

        tree = estimator.tree_
        
        if tree.feature[i] == 0:
            plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k')
            plot_boundaries(tree.children_left[i],
                            [xlim[0], tree.threshold[i]], ylim)
            plot_boundaries(tree.children_right[i],
                            [tree.threshold[i], xlim[1]], ylim)
        
        elif tree.feature[i] == 1:
            plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
            plot_boundaries(tree.children_left[i], xlim,
                            [ylim[0], tree.threshold[i]])
            plot_boundaries(tree.children_right[i], xlim,
                            [tree.threshold[i], ylim[1]])
            
    if boundaries:
        plot_boundaries(0, plt.xlim(), plt.ylim())
In [85]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
In [86]:
clf = DecisionTreeClassifier()
clf, xtr, ytr, xte, yte = classify_tree(dfsouth, ['palmitic','palmitoleic'], 'area', clf)
visualize_tree(clf, xtr, ytr, xte, yte, boundaries=False)
Accuracy on training data: 0.99
Accuracy on test data:     0.81

One critical byproduct we get from trees is feature importances

In [87]:
zip(['palmitic','palmitoleic'],clf.feature_importances_)
Out[87]:
[('palmitic', 0.2905013381766785), ('palmitoleic', 0.70949866182332166)]

Run again and again and see the overfit. This Cries Overfit. So why use it?

In [88]:
clf = DecisionTreeClassifier()
clf, xtr, ytr, xte, yte = classify_tree(dfsouthns, ['palmitic','palmitoleic'], 'area', clf)
visualize_tree(clf, xtr, ytr, xte, yte, boundaries=False)
Accuracy on training data: 1.00
Accuracy on test data:     0.89

We use it to create a random forest from a set of trees. We do two things:

  • Bagging : here we create a bootstrap sample with replacement and we run on this sample
  • We choose random subsets of the features for each tree

random forest

When we do this we get trees lke the ones seen in the diagram above. What we do now, is that for each region of the space (on some suitable grid) we find out what result each tree gives us. Then we go majority vote, like in kNN. This process of resampling and averaging over the trees reduces the overfitting a bit. For two features the second dosent make a huge amount of sense, but lets see it as we can see that the overfitting is softened a bit by the averaging:

In [89]:
clf=RandomForestClassifier(n_estimators=100)
clf, Xtrain, ytrain, Xtest, ytest=classify_tree(dfsouthns, ['palmitic','palmitoleic'],'area',clf)
visualize_tree(clf, Xtrain, ytrain, Xtest, ytest, boundaries=False)
Accuracy on training data: 0.99
Accuracy on test data:     0.93

Lets run it on the entire set of features

In [90]:
clf=RandomForestClassifier(n_estimators=100)
clf, Xtrain, ytrain, Xtest, ytest=classify_tree(dfsouthns, acidlist,'area',clf)
Accuracy on training data: 1.00
Accuracy on test data:     0.95

In [91]:
zip(acidlist,clf.feature_importances_)
Out[91]:
[('palmitic', 0.092863828502520895),
 ('palmitoleic', 0.30272380997266835),
 ('stearic', 0.06250448555175267),
 ('oleic', 0.16984664744969755),
 ('linoleic', 0.28775641089369225),
 ('linolenic', 0.043670925911337768),
 ('arachidic', 0.025792001319977144),
 ('eicosenoic', 0.014841890398353237)]
In [92]:
clf=RandomForestClassifier(n_estimators=100)
clf, Xtrain, ytrain, Xtest, ytest=classify_tree(dfsouth, acidlist,'area',clf)
Accuracy on training data: 1.00
Accuracy on test data:     0.95

In [93]:
zip(acidlist,clf.feature_importances_)
Out[93]:
[('palmitic', 0.095147895379220332),
 ('palmitoleic', 0.21468435069471561),
 ('stearic', 0.086499974783136549),
 ('oleic', 0.20927065191242025),
 ('linoleic', 0.19605824778342462),
 ('linolenic', 0.06828965638553143),
 ('arachidic', 0.054004315498520396),
 ('eicosenoic', 0.076044907563030839)]

Cross-validation is not needed

unless there are any wierd hyperparams you want to fix. Remember cross-val does two things: prevent overfitting, but also fits hyperparams.

In [94]:
def classify_tree_on_full(indf, inacidlist, clon, clf):
    subdf=indf[inacidlist]
    subdfstd=(subdf - subdf.mean())/subdf.std()
    X=subdfstd.values
    y=indf[clon].values
    clf=clf.fit(X, y)
    training_accuracy = clf.score(X, y)
    test_accuracy = clf.oob_score_
    print "Accuracy on all data: %0.2f" % (training_accuracy)
    print "Accuracy on oob data:     %0.2f" % (test_accuracy)
    return clf, test_accuracy
In [95]:
newclf=RandomForestClassifier(oob_score=True, n_estimators=100)
newclf, _=classify_tree_on_full(dfsouth, acidlist, 'area', newclf)
zip(acidlist,newclf.feature_importances_)
Accuracy on all data: 1.00
Accuracy on oob data:     0.93

Out[95]:
[('palmitic', 0.099832081910762829),
 ('palmitoleic', 0.2190569282572071),
 ('stearic', 0.083836606845841932),
 ('oleic', 0.18777234847078528),
 ('linoleic', 0.21947564744003603),
 ('linolenic', 0.066306550073997564),
 ('arachidic', 0.065469828044749614),
 ('eicosenoic', 0.058250008956619803)]
In [96]:
accus=[]
myrange=range(10, 250, 10)
for i,n in enumerate(myrange):
    newclf=RandomForestClassifier(oob_score=True, n_estimators=n)
    newclf, a=classify_tree_on_full(dfsouth, acidlist, 'area', newclf)
    accus.append(a)
plt.plot(myrange, accus);
Accuracy on all data: 0.99
Accuracy on oob data:     0.89
Accuracy on all data: 1.00
Accuracy on oob data:     0.92
Accuracy on all data: 1.00
Accuracy on oob data:     0.91
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.92
Accuracy on all data: 1.00
Accuracy on oob data:     0.92
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.92
Accuracy on all data: 1.00
Accuracy on oob data:     0.92
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.94
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.94
Accuracy on all data: 1.00
Accuracy on oob data:     0.92
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.93
Accuracy on all data: 1.00
Accuracy on oob data:     0.92
Accuracy on all data: 1.00
Accuracy on oob data:     0.93

//anaconda/lib/python2.7/site-packages/sklearn/ensemble/forest.py:373: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
  warn("Some inputs do not have OOB scores. "

In [97]:
newclf.oob_score_
Out[97]:
0.92879256965944268

13. Comparisons, hyperparameters, the structure of things, philosophy, and all that

In [98]:
#doing it multi-dimensionally
thesvcfitmd, Xtr, ytr, Xte, yte = get_optim_classifier_svm(dfsouth, acidlist,'area', 3)
#thesvcfit, Xtr, ytr, Xte, yte = get_optim_classifier_binary(dfsouthns, ['palmitic','palmitoleic'],'area', 3)
thesvcfitmd.best_estimator_, thesvcfitmd.best_params_, thesvcfitmd.best_score_
Out[98]:
(SVC(C=71.96856730011514, cache_size=200, class_weight=None, coef0=0.0,
   degree=3, gamma=0.0, kernel='linear', max_iter=-1, probability=True,
   random_state=None, shrinking=True, tol=0.001, verbose=False),
 {'C': 71.96856730011514},
 0.98062015503875966)
In [99]:
thesvcfitmd.grid_scores_#regularization worsens
Out[99]:
[mean: 0.63953, std: 0.00840, params: {'C': 0.0001},
 mean: 0.63953, std: 0.00840, params: {'C': 0.00013894954943731373},
 mean: 0.63953, std: 0.00840, params: {'C': 0.00019306977288832496},
 mean: 0.63953, std: 0.00840, params: {'C': 0.00026826957952797245},
 mean: 0.63953, std: 0.00840, params: {'C': 0.00037275937203149379},
 mean: 0.63953, std: 0.00840, params: {'C': 0.0005179474679231213},
 mean: 0.70543, std: 0.04276, params: {'C': 0.00071968567300115217},
 mean: 0.78295, std: 0.05239, params: {'C': 0.001},
 mean: 0.86047, std: 0.05512, params: {'C': 0.0013894954943731374},
 mean: 0.90698, std: 0.06499, params: {'C': 0.0019306977288832496},
 mean: 0.92636, std: 0.05754, params: {'C': 0.0026826957952797246},
 mean: 0.95349, std: 0.04402, params: {'C': 0.0037275937203149379},
 mean: 0.96512, std: 0.02651, params: {'C': 0.0051794746792312128},
 mean: 0.96512, std: 0.02651, params: {'C': 0.0071968567300115137},
 mean: 0.96512, std: 0.02655, params: {'C': 0.01},
 mean: 0.96512, std: 0.02655, params: {'C': 0.013894954943731374},
 mean: 0.96899, std: 0.02839, params: {'C': 0.019306977288832496},
 mean: 0.96899, std: 0.02839, params: {'C': 0.026826957952797246},
 mean: 0.96512, std: 0.02655, params: {'C': 0.037275937203149381},
 mean: 0.96512, std: 0.02655, params: {'C': 0.051794746792312073},
 mean: 0.96124, std: 0.02945, params: {'C': 0.071968567300115138},
 mean: 0.96899, std: 0.02844, params: {'C': 0.10000000000000001},
 mean: 0.96899, std: 0.02844, params: {'C': 0.13894954943731375},
 mean: 0.96899, std: 0.02839, params: {'C': 0.19306977288832497},
 mean: 0.97287, std: 0.02965, params: {'C': 0.26826957952797248},
 mean: 0.96899, std: 0.03280, params: {'C': 0.37275937203149379},
 mean: 0.97287, std: 0.03386, params: {'C': 0.51794746792312074},
 mean: 0.97287, std: 0.03386, params: {'C': 0.71968567300115138},
 mean: 0.97287, std: 0.03386, params: {'C': 1.0},
 mean: 0.97287, std: 0.03386, params: {'C': 1.3894954943731359},
 mean: 0.96899, std: 0.03280, params: {'C': 1.9306977288832496},
 mean: 0.96512, std: 0.03959, params: {'C': 2.6826957952797219},
 mean: 0.96512, std: 0.03959, params: {'C': 3.7275937203149381},
 mean: 0.96124, std: 0.04163, params: {'C': 5.1794746792312125},
 mean: 0.96124, std: 0.04163, params: {'C': 7.1968567300115138},
 mean: 0.96124, std: 0.04163, params: {'C': 10.0},
 mean: 0.96124, std: 0.04163, params: {'C': 13.89495494373136},
 mean: 0.96124, std: 0.04163, params: {'C': 19.306977288832496},
 mean: 0.96512, std: 0.03987, params: {'C': 26.826957952797219},
 mean: 0.96899, std: 0.03735, params: {'C': 37.275937203149383},
 mean: 0.97287, std: 0.03825, params: {'C': 51.794746792312019},
 mean: 0.98062, std: 0.02506, params: {'C': 71.96856730011514},
 mean: 0.98062, std: 0.02506, params: {'C': 100.0},
 mean: 0.98062, std: 0.02506, params: {'C': 138.94954943731361},
 mean: 0.97674, std: 0.02475, params: {'C': 193.06977288832496},
 mean: 0.97674, std: 0.02475, params: {'C': 268.26957952797216},
 mean: 0.97674, std: 0.02475, params: {'C': 372.7593720314938},
 mean: 0.97674, std: 0.02475, params: {'C': 517.94746792312026},
 mean: 0.97674, std: 0.02475, params: {'C': 719.68567300115137},
 mean: 0.97674, std: 0.02475, params: {'C': 1000.0}]
In [100]:
plt.scatter([e[0]['C'] for e in thesvcfitmd.grid_scores_],[e[1] for e in thesvcfitmd.grid_scores_])
plt.xscale("log");
plt.xlim([0.0001, 10000]);
In [101]:
thesvcfitmd.score(Xte, yte)
Out[101]:
0.96923076923076923
In [102]:
def classify_tree_on_full_binary(indf, inacidlist, clon, clonval, clf):
    subdf=indf[inacidlist]
    subdfstd=(subdf - subdf.mean())/subdf.std()
    X=subdfstd.values
    y=(indf[clon].values==clonval)*1
    clf=clf.fit(X, y)
    training_accuracy = clf.score(X, y)
    test_accuracy = clf.oob_score_
    print "Accuracy on all data: %0.2f" % (training_accuracy)
    print "Accuracy on oob data:     %0.2f" % (test_accuracy)
    return clf, test_accuracy
In [103]:
newclf=RandomForestClassifier(oob_score=True, n_estimators=100)
newclf, _=classify_tree_on_full_binary(dfsouth, acidlist, 'area', 3, newclf)
Accuracy on all data: 1.00
Accuracy on oob data:     0.95

Questions

  1. similarity of bootstrap and cross-val
  2. all these hyperparams: isnt there a better way? isnt crossval for hyperparams just another fitting layer and must also be cross-valled?
  3. issues with this dataset being small.
  4. imbalances, etc you must look at
  5. bayesian methods for selection, etc
  6. parametric models that we didnt talk about.